Speech recognition method and system

ABSTRACT

Frames making up an input speech are each collated with a string of phonemes representing speech candidates to be recognized, whereby evaluation values regarding the phonemes are computed. The frames are each compared with part of the phoneme string so as to reduce computations and memory capacity required in recognizing the input speech based on the evaluation values. That is, each frame is compared with a portion of the phoneme string to acquire an evaluation value for each phoneme. If the acquired evaluation value meets a predetermined condition, part of the phonemes to be collated with the next frame are changed. Illustratively, if the evaluation value for the phoneme heading a given portion of collated phonemes is smaller than the evaluation value of the phoneme which terminates that phoneme portion, then the head phoneme is replaced by the next phoneme. The new portion of phonemes obtained by the replacement is used for collation with the next frame.

TECHNICAL FIELD

The present invention relates to a speech recognition method forrecognizing input speech using phoneme and language models, as well asto a speech recognition system adopting that method.

BACKGROUND ART

Today, functions and devices of speech recognition are finding their wayinto small-sized data apparatuses such as portable speech translatorsand personal digital assistants (PDA), as well as into car navigationsystems and many other appliances and systems.

A conventional speech recognition method typically involves storingphoneme and language models beforehand and recognizing input speechbased on the stored models, as described illustratively in“Automatically Translated Telephone” (pp. 10-29, from Ohm-sha in Japanin 1994, edited by Advanced Telecommunications Research InstituteInternational). A language model is made up of pronunciations ofdifferent words and syntax rules, whereas each phoneme model includesspectral characteristics of each of a plurality of speech recognitionunits. The speech recognition unit is typically a phoneme or a phonemeelement that is smaller than a phoneme. The background art of this fieldwill be described below with phonemes regarded as speech recognitionunits. Spectral characteristics stored with respect to each phoneme maysometimes be referred to as a phoneme model of the phoneme in question.

The language model determines a plurality of allowable phoneme strings.At the time of speech recognition, a plurality of phoneme model stringsare generated corresponding to each of the allowable phoneme strings.The phoneme model strings are each collated with the input speech sothat the phoneme model string of the best match is selected. Incollating each phoneme model string with the input speech, the inputspeech is divided into segments called frames. The frames are eachcollated successively with a plurality of phoneme models constitutingeach phoneme model string so as to compute evaluation valuesrepresenting similarities between the phoneme model in question and theinput speech. This collating process is repeated with different phonememodel strings, and then with different frames. The evaluation valuesobtained by collating the phoneme models of each phoneme model stringwith a given frame of the input speech are also used in the collation ofthe next frame.

As outlined above, the conventional speech recognition method takes timeto make processing because it involves collating all frames of the inputspeech with all phoneme models in all phoneme model strings.Furthermore, it is necessary to retain in memory, for collation of thenext frame, the evaluation values acquired by collating the phonememodels in each phoneme model string with a given frame of the inputspeech. As a result, an ever-larger amount of memory is needed thegreater the total number of phoneme model strings.

The so-called beam search method has been proposed as a way to reducesuch prolonged processing time. This method involves, at the time ofcollating the input speech with each frame, limiting the phoneme modelsonly to those expected to become final candidates for speechrecognition. More specifically, checks are made on all phoneme modelstrings to see, based on the evaluation values computed in a given framefor all phoneme model strings, whether each of the phoneme models shouldbe carried forward for collation in the next frame. There are a numberof schemes to determine how to carry forward phoneme models: (1) a fixednumber of phoneme models starting from the model of the highestevaluation value are carried forward; (2) an evaluation value thresholdis computed so that only the phoneme models with their evaluation valueshigher than the threshold are carried forward; or (3) the above twoschemes are used in combination.

DISCLOSURE OF INVENTION

One disadvantage of the conventional beam search method is that it takestime to select phoneme models. That is, scheme (1) above of carryingforward a fixed number of phoneme models starting from the model of thehighest evaluation value must sort the evaluation values of all phonememodels. Sorting generally takes time. According to scheme (2) abovewhereby only the phoneme models with their evaluation values higher thana threshold are carried forward, it also takes time to compute thethreshold value.

It is therefore an object of the present invention to provide a speechrecognition method suitable for minimizing computing time and forreducing the required memory capacity, and a speech recognition systemadopting that method.

In carrying out the invention and according to one aspect thereof, thereis provided a speech recognition method for collating a portion ofspeech (e.g., frame) with part of a plurality of speech recognitionunits (e.g., phonemes or phoneme elements) representing speechcandidates. Depending on the result of the collation with the currentspeech portion, the method dynamically selects that part of speechrecognition units which is to be collated with the next speech portion.Because only the necessary parts of speech recognition units arecollated, the processing time and memory area for collation purposes aresignificantly reduced.

The inventive speech recognition method comprises the steps of:

(a) collating one of the plurality of speech candidates successivelywith an ordered plurality of speech parts obtained by dividing thetarget speech; and

(b) performing the step (a) on another plurality of speech candidates;

wherein the step (a) includes the steps of:

(a1) determining a plurality of likelihoods representing similaritiesbetween one of the ordered plurality of speech parts on the one hand,and a portion of speech recognition units constituting part of anordered plurality of speech recognition units representing one of theplurality of speech candidates on the other hand;

(a2) determining a plurality of evaluation values representingsimilarities between the portion of speech recognition units and thetarget speech, based on the plurality of likelihoods determined in thestep (a1) and on a plurality of transition probabilities correspondingto different combinations of the portion of speech recognition units;and

(a3) determining, based on the determined plurality of evaluationvalues, a new portion of speech recognition units for use with the nextspeech part in the ordered plurality of speech parts;

wherein the new portion of speech recognition units is used when thestep (a) is carried out on the next speech part in the ordered pluralityof speech parts.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic block diagram of a speech recognition system thatuses a speech recognition method according to the invention;

FIG. 2 is a schematic flowchart of steps constituting a speechrecognition program used by the system of FIG. 1;

FIG. 3 is a schematic flowchart of steps detailing a collating process(step 207) in the flowchart of FIG. 2;

FIG. 4 is a schematic flowchart of steps detailing an evaluation valuecomputing process (step 404) and a collation starting position updatingprocess (step 406) of FIG. 3;

FIG. 5 is a schematic view showing a conventional procedure forcomputing evaluation values regarding speech model strings;

FIG. 6 is a schematic view illustrating a procedure for computingevaluation values using transition probabilities; and

FIG. 7 is a schematic view depicting a procedure of the invention forcomputing evaluation values regarding speech model strings.

BEST MODE FOR CARRYING OUT THE INVENTION

In FIG. 1, reference numeral 101 stands for a speech input microphone;102 for an amplifier and an A/D converter; 103 for an FIFO buffer thattemporarily holds the input speech; 104 for a dictionary/syntax filethat stores a vocabulary of words or the like representing speechcandidates to be recognized, as well as syntax rules; and 105 for aphoneme model file that stores phoneme models with respect to aplurality of speech recognition units. The files 104 and 105 are eachimplemented typically in the form of a ROM such as a semiconductor ROMor a CD-ROM. With this embodiment of the invention, phonemes are used asspeech recognition units. Also in FIG. 1, reference numeral 107 standsfor a read-only memory (ROM) that stores a speech recognition program;108 for a random access memory (RAM) used by the program as a work area;and 109 for any one of external interface circuits for transferringrecognition result data to a display device (not shown) or to some otherdevice over a transmission line. Reference numeral 106 denotes amicroprocessor (CPU) that controls the above-mentioned circuits andmemories through a bus 110 or signal lines not shown. Of the configureddevices, those except for the microphone 101 should preferably befabricated on a single semiconductor chip by use of integrated circuittechnology.

When initialized by a POWER-ON-RESET or like command, the CPU transfersthe speech recognition program from the ROM 107 to the RAM 108. Theprogram is transferred so as to take advantage of the RAM 108 affordinghigher access speeds than the ROM 107. After the program transfer, theCPU carries out the transferred program.

How the speech recognition program works is described below withreference to the flowchart of FIG. 2. When the program is started, thephoneme model file 105 is read into the RAM 108 (step 201). The phonememodel file 105 contains characteristic vectors obtained by analyzingeach of a plurality of phonemes used as speech recognition units. Thecharacteristic vectors are the same as those generated for input speech,to be described later. With this embodiment, speech recognition iscarried out on the commonly utilized hidden Marcov model (called the HMMmodel hereunder). For the speech recognition pursuant to the HMM model,a phoneme model associated with each phoneme has transitionprobabilities regarding transition from the phoneme in question to otherphonemes.

The dictionary/syntax file 104 is then read into the RAM 108 (step 202).The dictionary/syntax file 104 contains a vocabulary of words or thelike with which to recognize a target speech. Each word is composed of acharacter string representing a plurality of speech recognition unitsmaking up the word to be recognized. More specifically, each word ismade up of a series of alphabetic characters denoting a group ofphonemes constituting the word to be recognized. For example, a name“SUZUKI” is represented by a string of three phonemes “su, zu, ki.”Although the file 104 also includes syntax rules, descriptions of anyspeech recognition procedures using syntax rules are omitted, and speechrecognition processing by use of only words will be described below.

Each of the words in the dictionary/syntax file 104 is translated intothe corresponding phoneme model string (step 203). The translationreplaces each of the phonemes constituting each word read in step 202 bya phoneme model corresponding to each phoneme read in step 201. Thisprovides a phoneme model string corresponding to the phoneme stringmaking up each word.

The speech input through the microphone 101 is amplified by theamplifier and A/D converter 102 before being translated into a digitalsignal. The digital signal thus acquired is sampled in increments of apredetermined time unit through the FIFO buffer 103. Speech data sampledat several points are sent collectively to the RAM 108 (step 205). Thecollective speech data of several points are called a frame. A framegenerally refers to speech data over a period of 15 to 30 ms. The nextframe is usually generated from the input speech by a shift over a timeperiod shorter than one frame time (e.g., 5 to 20 ms). The spectrum ofeach frame is analyzed, and a characteristic vector string denotingcharacteristics of the frame in question is created (step 206). Thecommonly utilized linear predictive coding (LPC) is used for theanalysis, and an LPC cepstrum is generated as a characteristicparameter. However, this is not limitative of the invention. Otherspeech analyses may be used instead, and alternative characteristicvectors such as LPC delta cepstrum, mel-cepstrum or logarithmic powermay also be used.

Based on the generated characteristic vector string and on the phonememodel string constituting each of the words obtained in step 203, eachof the phonemes making up each word is collated with the input frame ina manner stipulated by the HMM model. More specifically, likelihoodsbetween each of the phonemes and the input frame are computed. Based onthe likelihoods thus acquired, a plurality of evaluation values arecomputed which represent similarities between each of the phonemesmaking up the phoneme string corresponding to the word in question onthe one hand, and the input frame on the other hand (step 207). Detailsof the computations will be explained later. The collating process iscarried out on every word contained in the dictionary/syntax file 104.Thereafter, steps 205 through 207 are repeated on subsequent frames. Instep 204 reached during the repetitive processing, a check is made tosee if the input speech has ended on the basis of its power value. Ifthe input speech is judged to be have ended, the word including thephoneme of the highest evaluation value is selected (step 208). Theselected word is transferred to a display device or a data processor,not shown, through the external interface circuit 109 (step 209). In thecollating step 207, a known forward computation algorithm is used tocompute evaluation values regarding the phonemes in each phoneme modelstring. This embodiment is characterized by a function to restrict thephoneme models of the phonemes for which evaluation values are computed.

Below is a description of a conventional method for computing evaluationvalues using a forward computation algorithm, followed by a descriptionof the inventive method for computing evaluations values using the sameforward computation algorithm.

FIG. 5 is a trellis chart showing a conventional method that uses theforward computation algorithm in computing evaluation values of eachphoneme model within each phoneme model string. The speech recognitionbased on the HMM model regards each of a plurality of phoneme modesconstituting a phoneme model string corresponding to one word asrepresentative of one state. In FIG. 5, states 1 through 4 arevertically shown to represent four phoneme models making up a phonememodel string corresponding to each word. The four states are ordered thesame way as the phoneme models constituting the phoneme model string inquestion. For example, states 1 through 4 correspond to a first througha fourth phoneme model which appear in the phoneme model string. In FIG.5, successively input frames are presented in the horizontal direction.

Each of circles in FIG. 5 denotes a combination of a frame with a state,and each encircled number represents an evaluation value of the phonememodel corresponding to the state in question. Shown at the top right ofeach circle is a likelihood between the frame in question and itsphoneme model. How such likelihoods are computed will be explainedlater. A number on a rightward arrow coming out of each circle denotesthe probability that the state in question will change to the same statein the next frame. Such transition probabilities are determined inadvance independently of input frames. Likewise, a number on an arrowfrom each circle pointing to bottom right denotes the probability thatthe state in question will change to the next state in the next frame.Such transition probabilities are also determined beforehandindependently of input frames.

The evaluation value of each state for frame 0 is given as an initialvalue. Because frame 0 is considered to head an input speech signal, aninitial value of 0 is assigned only to the evaluation value of thephoneme model (state 1) which heads the phoneme model stringconstituting the target word to be recognized; evaluation values of theother phoneme models are assigned “−∞” as an initial value each. Theinitial values are thus established on the assumption that the firstframe of an input speech always matches the phoneme model that heads aphone model string. The evaluation value of each of the states in frame1 and subsequent frames is determined based on the likelihood regardingthe preceding frame and the state associated therewith, and on thetransition probability defined beforehand about the state in question.How such determination takes place is described below.

In FIG. 6, it is assumed that state “i” and state “j” in the phonememodel string constituting a given word are assigned evaluation values Aand B, respectively, with regard to frame “n.” These evaluation valuesare either determined by a collating process in the preceding frame orgiven as initial values regarding the states in question. Likelihoodsbetween frame “n” and states “i” and “j” are computed as similaritiesbetween the phoneme model corresponding to each state on the one handand frame “n” on the other hand. More specifically, the similarities arerepresented in a known manner by distances between the characteristicvector of the phoneme model corresponding to each state on the one hand,and the characteristic vector obtained by analyzing frame “n” on theother hand. In practice, such distances are typically Euclidiandistances that are each given in a known manner by the squared sum ofcoordinate differences in different dimensions between two vectors. Ifeach dimension requires normalization, acquisition of the squared sum ispreceded by normalization of each dimension using a predeterminedcoefficient. If the above-mentioned LPC cepstrum is used as thecharacteristic vector, these coefficients are cepstrum coefficients.With this embodiment, the likelihoods between states “i” and “j” on theone hand and frame “n” on the other hand are assumed as Ni and Njrespectively.

An evaluation value C of state “i” regarding the next frame “n+1” isrepresented, if no state precedes that state “i,” by the sum of thelikelihood Ni and a transition probability Pii from state “i” to state“i.” The evaluation value of state “j” regarding frame “i+1” is computedas follows: suppose that a transition is effected from state “i” inframe “n” to state “j” in frame “n+1.” In that case, an evaluation valueDi of state “j” in frame “n+1” is given as the sum of the evaluationvalue A of state “i” in frame “n,” likelihood Ni of state “i” in frame“n,” and a transition probability Pij from state “i” to state “j.” If itis assumed that a transition is effected from state “j” in frame “n” tostate “j” in frame “n+1,” then an evaluation value Dj of state “j” inframe “n+1” is given as the sum of the evaluation value B of state “j”in frame “n,” likelihood Nj of state “j” in frame “n,” and a transitionprobability Pjj from state “j” to state “j.” Eventually, the greater ofthe evaluation values Di and Dj is adopted as the evaluation value forstate “j” in frame “n+1.” While likelihoods about states “i” and “j” inframe “n+1” are also computed, they are used to calculate evaluationvalues of these states in the next frame “n+2.”

FIG. 5 shows evaluation values computed in the above-described mannerranging from frame 0 to frame 5. When the last frame is reached, thelargest of a plurality of evaluation values regarding a given word isadopted as the evaluation value of the word in question. In the exampleof FIG. 5, a value 319 denotes the evaluation value of the word.Although the example of FIG. 5 assumes only one of two states (i.e., thesame state or the next state) as the destination of the transitionstarting from each state, a transition from a given state may in factoccur to any one of a larger number of states. In this case, theevaluation values of each state following the transition are alsocomputed.

Referring to the trellis chart obtained in the manner described above,the state with the largest evaluation value is selected for each frame,and the states thus selected for different frames make up a path knownas a Viterbi path. In the example of FIG. 5, the path connecting (frame0, state 1), (frame 1, state 2), (frame 2, state 2), (frame 3, state 3),(frame 4, state 3) and (frame 5, state 4) constitutes a Viterbi path.The state having the highest evaluation value in a given framerepresents the phoneme of the highest similarity in the word beingprocessed with respect to the frame in question. It follows that thestring of states linked by a Viterbi path denotes the phoneme stringsimilar to the input speech with regard to the word being processed.

As described above, the conventional method for evaluation valuecomputation requires computing the evaluation values of all statescorresponding to target phoneme model strings regarding all frames. Thismeans that the number of computations, defined by the expression shownbelow, increases as the number of words and the number of frames becomelarger. This requires installing a growing quantity of memory foraccommodating interim evaluation values. The expression is:

No. of computations=No. of frames×No. of words×average No. of phonememodels regarding each word

This embodiment of the invention alleviates the above problem by havingrecourse to the collating step 207 (FIG. 2). The step involves limitingtarget phoneme models so that evaluation values are computed regardingonly part of all phoneme models constituting the phoneme model stringcorresponding to each word. The collating step 207 is explained below.

In carrying out the collating step 207, as shown in FIG. 3, evaluationvalues are computed between an input frame and each of a selected groupof phoneme models constituting a phoneme model string corresponding tothe word in question (step 404).

Of the phoneme models in the phoneme model string corresponding to eachword, a predetermined plurality of phoneme models (m+1) which head thephoneme model string are selected for collation with the first frame.Given the result of step 404, the phoneme model heading the group ofphoneme models for collation with the next frame is determined withinthe phoneme model string with respect to the same word (step 406). Thenthe next word is selected for the collation (step 407). Steps 404 and406 are repeated until all words in the dictionary/syntax file 104 havebeen exhausted (step 401).

More specifically, as shown in FIG. 4, step 404 is started by checkingto see if the frame being processed is the first frame (step 801). Ifthe current frame is found to be the first frame, then a value of 1 isset for position “n” of the phoneme model from which to start thecollation within the phoneme model string corresponding to the word inquestion. That is, the collation is set to begin from the phoneme modelthat heads the phoneme model string. If the current frame is not thefirst frame, then the value of the collation starting position “n”determined in step 406 is used for the frame. Based on the n-th phonememodel thus determined, step 803 is carried out to compute evaluationvalues Pn(I) through Pn+m(I) regarding each of the (n+m) th phonememodel s with respect to the input frame. Reference character I denotes aframe number. The computations involved here are performed as per theconventional method described with reference to FIG. 5. Likelihoodsbetween each of the phoneme models on the one hand and the input frameon the other hand are also computed according to the conventionalmethod.

Later, step 406 is executed to determine the collation starting positionin the next frame regarding the same word by use of the computedevaluation values. More specifically, a comparison is made betweenevaluation values Pn(I) and Pn+m(I) with respect to the n-th and (n+m)the phoneme models positioned on both ends of the collated group of(m+1) phoneme models (step 805). If the evaluation value Pn+m(I) isfound to be larger than the evaluation value Pn(I), then the collationstarting position “n” for the next frame is incremented by 1 (step 806).That is because if the result of check 804 is affirmative, then theinput frame is judged to be already less similar to the n-th phonememodel than to a subsequent phoneme model. If the result of the check 805is negative, the collation starting position “n” remains unchanged. Asdiscussed above with reference to FIG. 5, a Viterbi path has only to bedistinguished correctly and evaluation values need only be computedprecisely with regard to the states (phoneme models) on that path. Ifthe result of the check 805 is found to be affirmative, that means theViterbi path in the trellis chart now passes through a phoneme modelsubsequent to the n-th phoneme model in the frame being processed. It isthus expected that in the ensuing frames, omitting the computation ofevaluation values regarding the nth phoneme model will not result inerroneous computations of evaluation values.regarding the word beingprocessed.

The steps above are repeated on each of the subsequent frames. Step 805is preceded by step 804 which determines whether there exists any otherstate that may be changed into a new target for collation. Morespecifically, a check is made in step 804 to see if the number (n+m) ofthe phoneme model at the end of the current group of phoneme modelsbeing collated is equal to the total number of states regarding the wordbeing processed. That is, a check is made to see if the (n+m) th phonememodel at the end of the phoneme model string being processed is the lastphoneme model in the phoneme model string with respect to the word beingprocessed. If the result of the check in step 804 is affirmative, steps805 and 806 will not be carried out. Thus if any new frame issubsequently input, evaluation values are continuously computedregarding (m+1) phoneme models at the end of the phoneme model string.This completes the collation of one frame with the phoneme model stringcorresponding to one word.

Illustratively, FIG. 7 is a trellis chart in effect when m+1=2 betweenthe phoneme model string shown in FIG. 5 and the input frame string 0,1, etc., in the figure. Of states 1 through 4, only two states (m+1=2)are subjected to the following steps:

In the first frame 0, the result of the check in 801 is affirmative. Instep 802, the collation starting position “n” for states 1 and 2 is setto 1 so that states 1 and 2 are selected to be collated. In step 803,evaluation values P1(0) and P2(0) of these states are not computed.Instead, initial values 0 and −∞ are used unmodified as the evaluationvalues of states 1 and 2. Step 803 is carried out to compute likelihoodsof states 1 and 2 in frame 0. The likelihoods are assumed here to be 30and 20 for states 1 and 2 respectively. The result of the check in step804 is negative with respect to the current group of phoneme modelsbeing collated. For the above two evaluation values, the result of thecheck in step 805 is negative. Thus the collation starting position “n”remains unchanged and the collation of the current word in frame 0 isterminated.

If the same word is collated with the next frame 1, the result of thecheck in step 801 is negative. In step 803, evaluation values P1(1) andP2(1) for the first and second phoneme models are computed. Theseevaluation values are assumed here to be 33 and 37 for the first andsecond phoneme models respectively. In step 803, likelihoods betweenframe 1 on the one hand and states 1 and 2 on the other hand are alsocomputed; the likelihoods are assumed to be 10 and 40 for states 1 and 2respectively. The result of the check in step 804 is negative for thecurrent group of phoneme models being collated. Because the result ofthe check in step 805 is affirmative for the two evaluation valuesabove, step 806 is carried out to update the collation starting position“n” to 2.

A comparison between the evaluation values for states 1 and 2 aboveshows that the evaluation value for state 2 is the greater of the two.That is, the input frame is judged to be more similar to state 2 than tostate 1. If the input frame actually matches state 2 at this point, thenthe evaluation value of state 1 is deemed not to affect the probabilityof the ultimate state in the word in question as far as the Viterbi pathsearch is concerned. Therefore the next state is reached in which tostart collation on the next frame 2.

If the same word is collated with the next frame 2, the result of thecheck in 801 is negative. Since the collation starting position “n” hasbeen updated to 2, step 803 is carried out to compute evaluation valuesP2(2) and P3(2) regarding the second and third phoneme models; theevaluation values are assumed here to be 85 and 84 for the second andthird phoneme models respectively. In step 803, likelihoods betweenframe 2 on the one hand and states 2 and 3 on the other hand are alsocomputed; the likelihoods are assumed here to be 50 and 40 for states 2and 3 respectively. For the current group of phoneme models beingcollated, the result of the check in step 804 is negative. Since theresult of the check in step 805 is negative with respect to the twoevaluation values above, step 806 is not executed, and the collationstarting position “n” is always 2.

If the same word is collated with the next frame 3, the result of thecheck in step 801 is negative. Since the collation starting position “n”is still 2, step 803 is carried out to compute evaluation values P2(3)and P3(3) regarding the second and third phoneme models; the evaluationvalues are assumed here to be 142 and 143 for the second and thirdphoneme models respectively. In step 803, likelihoods between frame 3 onthe one hand and state 2 and 3 on the other hand are also computed; thelikelihoods are assumed to be 10 and 90 for states 2 and 3 respectively.For the current group of phoneme models being collated, the result ofthe check in step 804 is negative. Because the result of the check instep 805 is affirmative for the two evaluation values above, step 806 isperformed to update the collation starting position “n” to 3.

If the same word is collated with the next frame 4, the result of thecheck in step 801 is negative. Since the collation starting position “n”has been updated to 3, step 803 is carried out to compute evaluationvalues P3(4) and P4(4) regarding the third and fourth phoneme models;the evaluation values are assumed here to be 241 and 240 for the thirdand fourth phoneme models respectively. In step 803, likelihoods betweenframe 4 on the one hand and states 2 and 3 on the other hand are alsocomputed; the likelihoods are assumed to be 70 and 30 for states 2 and 3respectively. For the current group of phoneme models being collated,the result of the check in step 804 is negative. Because the result ofthe check in step 805 is negative for the two evaluation values above,step 806 is not carried out and the collation starting position “n” isalways 3.

If the same word is collated with the next frame 5, the result of thecheck in step 801 is negative. Since the collation starting position “n”is still 3, step 803 is carried out to compute evaluation values P3(5)and P4(5) regarding the third and fourth phoneme models; the evaluationvalues are assumed here to be 318 and 319 for the third and fourthphoneme models respectively. In step 803, likelihoods between frame 5 onthe one hand and states 2 and 3 on the other hand are also computed.These likelihoods are omitted from FIG. 7. Because the result of thecheck in step 804 is affirmative for the current group of phoneme modelsbeing collated, steps 805 and 806 will not be carried out. The collationstarting position “n” is always 3. If there exists any subsequent frame,the same processing as that on frame 5 is carried out.

The evaluation value for the word being processed with respect to theinput speech up to frame 5 is the highest of all values obtained so far(319 in this example). This value is the same as that acquired by theconventional method shown in FIG. 5. However, as has been evident fromthe above computations, the embodiment of the invention computes, for agiven frame, evaluation values and likelihoods regarding only apredetermined number (m+1) of phoneme models (or states) among allmodels constituting the phoneme model string for a given word (or allstates with regard to the word). Thus if the average total number ofphoneme models regarding each word is illustratively 10 through 12 andif m+1=2, then the number of computations by the embodiment becomesabout one-fifth or one-sixth of the number of computations required bythe conventional method in FIG. 5. Correspondingly, the necessarycapacity of buffers for accommodating interim computation results isabout one-fifth or one-sixth of the buffer capacity required by theconventional method in FIG. 5. The inventive method is also moreadvantageous than the conventional beam search method in terms ofcomputation count and required memory capacity.

Variations

The above-described embodiment is for illustrative purposes only and notlimitative of the invention. Changes and variations may be made withoutdeparting from the spirit or scope of the invention. A few variationsare described below.

(1) The judging step 805 (FIG. 4) for determining whether or not tochange the phoneme model to be collated may be replaced by the followingprocess: of a group of the n-th through (n+)th phoneme models havingevaluation values Pn(I) through Pn+m(I), the phoneme model with thehighest evaluation value is detected. A check is made to see if thedetected phoneme model is located past the middle of the phoneme modelgroup toward the end of the group. That is, if the phoneme having thehighest evaluation value is the (n+g)th phoneme model, then a check ismade to see if g>m/2. If the phoneme model with the highest evaluationvalue is judged to be located past the middle of the phoneme model grouptoward the group end, then step 806 (FIG. 4) is carried out to updatethe collation starting position “n” by 1 with respect to the next frame.If “m” is equal to 1, the result of the check by this variation is thesame as that by the embodiment described above. In other words, if theresult of the check by this variation is to differ from that by theabove embodiment, then (m+1) must be greater than 2. The process of thisvariation for determining whether or not to change the phoneme model tobe collated is more accurate than step 805 of the above embodiment injudging the need to change the collation starting position. It should benoted however that the above embodiment is simpler in computingprocedures than this variation of the embodiment.

(2) A modification of the variation (1) above may involve updating thecollation starting position “n” not by 1 but by a value such that thephoneme model with the highest evaluation value is located approximatelyin the middle of the group of phone models being collated. In this case,too, (m+1) must be greater than 1. This modification is more accuratethan the corresponding step or process of the above embodiment orvariation in judging the need for changing the phoneme model to becollated.

(3) In any of the above-described embodiment, variations andmodifications, speech recognition units were assumed to be phonemes.That is, the dictionary/syntax file 106 contains character strings thatrepresent phoneme strings constituting words. The phoneme model file 105holds HMM models of various phonemes. The speech recognition programutilizes these files in generating the phoneme model stringcorresponding to each word. Alternatively, the invention may also beapplied to speech recognition systems that employ phoneme elements(i.e., units smaller than phonemes) as speech recognition units. Morespecifically, the phoneme model file 105 may contain models regardingthe phoneme elements smaller than phonemes. For example, a phoneme “su”is replaced by phoneme elements “ss” and “su” in memory, a phoneme “zu”by phoneme elements “zz” and “zu,” and a phoneme “ki” by phonemeelements “kk” and “ki.” The speech recognition program generates aphoneme element string “ss,” “su,” “zz,” “zu,” “kk” and “ki” with regardto the word “su zu ki.” In this case, too, each of the phoneme elementsis regarded as one of the states used by the above-described embodimentof the invention.

As described, the method and system according to the invention shortenthe time required to collate a plurality of speech recognition unitswith a given portion of the target input speech to be recognized. Theinventive method and system also reduce the capacity of memory needed toaccommodate computation results.

What is claimed is:
 1. A speech recognition method for collating atarget speech with each of a plurality of speech candidates in order torecognize said target speech, said speech recognition method comprisingthe steps of: (a) collating one of said plurality of speech candidatessuccessively with an ordered plurality of speech frames obtained bydividing said target speech; and (b) performing said step (a) on anotherplurality of speech candidates; wherein said step (a) includes the stepsof: (a1) determining a plurality of likelihoods representingsimilarities between one of said ordered plurality of speech frames, andconsecutive ones of ordered phonemes in a plurality of phoneme stringsrepresenting one of said plurality of speech candidates; (a2)determining a plurality of evaluation values representing similaritiesbetween said consecutive ones of said ordered phonemes and said targetspeech, based on said plurality of likelihoods determined in said step(a1) and on a plurality of transition probabilities corresponding todifferent combinations of said consecutive ones of the ordered phonemes;and (a3) if said evaluation value for the head phoneme in saidconsecutive ones of said ordered phonemes is smaller than that of thelast phoneme in said consecutive ones of said ordered phonemes,replacing phonemes to be collated for the next one of said speech frameswith new consecutive ones of said ordered phonemes, wherein said newconsecutive ones are said consecutive ones with the head phoneme removedfrom therefrom and with the next phoneme in the corresponding phonemestring added to said consecutive ones; wherein said new consecutive onesof the ordered phonemes are used when said step (a) is carried out onsaid next speech frame in said ordered plurality of speech frames.
 2. Aspeech recognition method according to claim 1, wherein said step (a3)includes the steps of: if said evaluation value for the head phoneme insaid consecutive ones of said ordered phonemes is not smaller than thatof the last phoneme in said consecutive ones of said ordered phonemes,then determining said consecutive ones of the ordered phonemes with nomodification as said new consecutive ones of the ordered phonemes.